- Head of Research Group: Dr. Balázs LIGETI
- Members of the Group: Babett BODNÁR, Bendegúz FILYÓ, Dr János JUHÁSZ, Judit JUHÁSZ, Dániel KRIZSÁN, Márton RÉTI
- Contact: ligeti.balazs@itk.ppke.hu
- Our research group focuses on large genomic and evolutionary context-aware neural net-work and sequence representations. A key and fundamental question in quantitative bi-ology is how to uncover novel patterns and structures in biological data, which is crucial for modeling, predicting, and manipulating complex organizations like a microbiome. Our most recent research focuses on understanding the complex relationships characterizing the microbiome, such as phage-bacteria interactions. Phages, which are the viruses of bacteria, can influence the structure of the microbiome, could serve as therapeutics as well as biomarker.
- We designed and implemented a genomic language model, ProkBERT (ProkBERT, Ligeti et al. 2024) to solve such bioinformatics tasks ProkBERT provides a reusable, neural network based representation, which can be applied on classification, re-gression or clustering tasks related to microbiome. The main advantages of the approach are that the model operates directly on nucleotide sequence, as opposed to traditional machine learning methods, which require tabular data created by a complicated bioin-formatics pipeline. It is widely adaptable and shows good generalization capabilities, e.g. providing high quality prediction on unseen data. It is compact, fast and easy to use, while computationally efficient.
Figure i) ProkBERT operates directly on genomic data. ProkBERT was trained on large corpora of microbial sequence data (bacteria, viral, archeae and fungi). It allows transfer learning by providing reusable sequence representations. The model is ideal for solving classification, clustering and regressions problems.
Figure ii) 2D representations of different genomic features of ESKAPE pathogens. The sequences are clustered by genomic structure: coding (blue) vs. non-coding (orange) regions (figure ii/a) as well as by phylogeny (figure ii/b). Despite no annotation information were used for training model, the model captured the genomic structure.